Skip to content

[BACKPORT 25.10] Added isCompleted check in getNumSpotInterruptions#6806

Merged
pditommaso merged 3 commits intoSTABLE-25.10.xfrom
6802-deadlock-backport-25.10.x
Feb 9, 2026
Merged

[BACKPORT 25.10] Added isCompleted check in getNumSpotInterruptions#6806
pditommaso merged 3 commits intoSTABLE-25.10.xfrom
6802-deadlock-backport-25.10.x

Conversation

@munishchouhan
Copy link
Collaborator

@munishchouhan munishchouhan commented Feb 5, 2026

This PR will backport #6805 to 25.10

Prevents the deadlock because:

  • Spot interruption counts are only meaningful after task completion
  • Returns null immediately for non-completed tasks
  • Avoids calling describeJob() and triggering the batching mechanism
  • Only queries AWS Batch API when the task is completed and we actually need the spot interruption count

Signed-off-by: munishchouhan <hrma017@gmail.com>
@pditommaso
Copy link
Member

What about the version in master ?

@munishchouhan
Copy link
Collaborator Author

What about the version in master ?

there is another pr for that
#6805

@thalassemia
Copy link

I want to add that this was throttling task submission rate to about 1 every few seconds using the Google Batch executor. Specifically, notifyTaskSubmit() --> getTraceRecord() --> getNumSpotInterruptions() --> describeTask() --> getTaskStatus() --> apply() would repeatedly fail with a NotFoundException because Batch had not finished creating the job. This triggerred the retry policy until the job came online or five attempts were made. Thanks for the quick fix!

@pditommaso pditommaso merged commit 21390a1 into STABLE-25.10.x Feb 9, 2026
9 of 19 checks passed
@pditommaso pditommaso deleted the 6802-deadlock-backport-25.10.x branch February 9, 2026 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants